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Abstract 

We present a space and time efficient practical parallel algorithm for approximating 
the diameter of massive weighted undirected graphs on distributed platforms supporting 
a MapReduce-like abstraction. The core of the algorithm is a weighted graph decompo¬ 
sition strategy generating disjoint clusters of bounded weighted radius. Theoretically, our 
algorithm uses linear space and yields a polylogarithmic approximation guarantee; more¬ 
over, for important practical classes of graphs, it runs in a number of rounds asymptotically 
smaller than those required by the natural approximation provided by the state-of-the- 
art A-stepping SSSP algorithm, which is its only practical linear-space competitor in the 
aforementioned computational scenario. We complement our theoretical findings with an 
extensive experimental analysis on large benchmark graphs, which demonstrates that our 
algorithm attains substantial improvements on a number of key performance indicators with 
respect to the aforementioned competitor, while featuring a similar approximation ratio (a 
small constant less than 1.4, as opposed to the polylogarithmic theoretical bound). 

Keywords Graph Analytics; Parallel Graph Algorithms; Weighted Graph Decomposition; 
Weighted Diameter Approximation; MapReduce 


1 Introduction 

The analysis of very large (typically sparse) weighted graphs is becoming a central tool in nu¬ 
merous domains, including geographic information systems, social sciences, computational bi¬ 
ology, computational linguistics, semantic search and knowledge discovery, and cyber security. 
A fundamental primitive for graph analytics is the estimation of a graph’s diameter, defined 
as the maximum weighted distance between nodes in the same connected component. For this 
primitive, which is computationally intensive, resorting to parallelism is inevitable as the graph 
size grows. Unfortunately, state of the art parallel strategies for diameter estimation are either 
space inefficient or incur long critical paths. These strategies are thus unfeasible for dealing 
with huge graphs, especially on distributed platforms characterized by limited local memory 
and high communication costs, (e.g., clusters of loosely-coupled commodity servers supporting a 
MapReduce-like abstraction), which are widely used for big data tasks, and represent the target 
computational scenario of this paper. In this setting, the challenge is to minimize the number of 
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communication rounds while using linear aggregate space and small (i.e., substantially sublinear) 
space in the individual processors. 

Previous work. For general graphs with arbitrary weights, the only known approach for the exact 
diameter computation requires the solution of the All-Pairs Shortest Paths (APSP) problem, but 
all known APSP algorithms are characterized by space and/or time requirements which make 
them impractical for very large graphs. For unweighted graphs, the HyperANF algorithm by 
[BRV11] computes a tight approximation to the eccentricity of each node (i.e., its maximum dis¬ 
tance from every other node), which in turn yields a tight approximation to the graph diameter. 
HyperANF allows an efficient multithreaded implementation and runs fast on a shared memory 
platform. However, it has a critical path equal to the graph diameter and requires a (small) non 
constant memory blow-up. Also, the algorithm cannot be adapted to deal with weighted graphs. 
For these reasons, HyperANF is not a viable competitor for the objectives of the present work. 

We observe that an upper bound to the diameter within a factor two can be easily computed 
by solving an instance of the Single-Source Shortest Path (SSSP) problem starting from an ar¬ 
bitrary node. A number of recent works have explored the issue of reducing the approximation 
factor by running several SSSP instances from carefully selected nodes [CGLM12, CLR+14], 
While in practice a few such instances suffice to obtain a very good approximation, worst-case 
approximation guarantees less than two are obtained at the expense of running a nonconstant 
number of SSSP instances. In fact, there is strong theoretical evidence of the difficulty of ap¬ 
proximating efficiently the diameter within a factor less than 3/2 (see [CLR+14] and references 
therein). 

A number of parallel SSSP algorithms have been devised in the last two decades [CohOO, 
KS97, MS03] (a more extensive account of the literature can be found in [MS03]). To the best of 
our knowledge, the most practical one is the A -stepping PRAM algorithm proposed in [MS03]. 
The algorithm uses a design parameter A to trade off parallel time and work. In a nutshell, 
the role of A is to stagger the computation of the SSSP tree into zones of bounded depth A. 
Small values of A reduce work at the expense of a longer critical path (i.e., parallel time) with 
the algorithm approaching the behaviour of Dijkstra’s algorithm; the reverse tradeoff is obtained 
when A increases, approaching in this case the behaviour of Bellman-Ford’s algorithm [CLRS09]. 
It can be shown that, irrespective of A, if linear space is required, then the algorithm’s parallel 
time is bounded from below by the unweighted diameter of the input graph. In fact, in [MS03] 
the authors show that smaller parallel time can be achieved by introducing suitable shortcut 
edges connecting all node pairs at distance at most A, which, however, have the potential of 
resulting into superlinear space complexity, especially for sparse graphs. 

In [CPPU15] we devised a parallel O (log 3 ?r)-approximation algorithm for computing the 
diameter of unweighted undirected graphs. The algorithm first determines a partition of the 
graph into k = o(n) disjoint clusters through a decomposition strategy based on growing the 
clusters from batches of centers progressively selected from yet uncovered nodes, which yields a 
provably small cluster radius. The diameter of the graph is then obtained through the diameter 
of a suitable quotient graph associated with the decomposition. The algorithm can be efficiently 
implemented in a MapReduce-like environment yielding good approximation quality in practice. 
A similar clustering-based approach for diameter estimation had been introduced in [Mey08] in 
the external-memory setting. We remark that no analytical guarantees would be provided by the 
weight-oblivious execution of these algorithms on a weighted graph since, for a given topology, 
the system of shortest paths may radically change once weights are introduced. 

Our contribution. In this paper we address the more challenging scenario of weighted graphs, and 
devise a diameter approximation strategy for distributed machines supporting a MapReduce-like 
abstraction, with provable theoretical guarantees, providing extensive experimental evidence of 
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its efficiency and effectiveness. 

Our algorithm adopts a cluster-based strategy similar to the one used in [CPPU15] for the 
unweighted case. The main difficulty in the weighted case steins from the cluster growth process: 
in the unweighted case we included all nodes in the frontier of all active clusters at each round; 
in the weighted case we need to avoid traversing heavy edges, which would increase the cluster 
radius and have an unpredictable impact on the approximation quality. On the other hand, a 
high rate of cluster growth is crucial to enforce a small round complexity. In order to tackle these 
conflicting goals, we combine the following three main ingredients: (1) the progressive cluster¬ 
growing strategy of [CPPU15]; (2) an upper bound A on the weight of the paths along which 
clusters are grown (an idea inspired by the A-stepping algorithm); and (3) a doubling strategy 
to guess a quasi-optimal value of A. 

Analytically, we prove that with high probability our algorithm attains an O (log 3 n) approx¬ 
imation ratio using a number of rounds which is O (£r logn), where R is the maximum cluster 
radius and £r is the number of edges required to connect any pair of nodes at distance at most 
R. Both R and £r are nonincreasing functions of the number k of clusters. To obtain the best 
round complexity, k can be chosen as the maximum value for which the diameter of the quotient 
graph can be computed efficiently in a single processor’s local memory in one round. The total 
memory space used by the algorithm is linear in the input graph size. We also show that on 
graphs of bounded doubling dimension [AGGM06] (an important family including, for example, 
multidimensional arrays), under random edge weights and for some positive constants 5 < e < 1, 
if the local memory available to each processor is 0 (n e ) then the round complexity of our algo¬ 
rithm can be made asymptotically smaller than the unweighted diameter of the graph by a factor 
0 (n 5 ), outperforming by at least that factor the approximation strategy based on A-stepping, 
when linear space is required. 

We complement the analytical results with an extensive set of experiments on several large 
benchmark graphs (of up to about one billion nodes and several billion edges), both real and 
synthetic, running on a 16-node cluster where the Spark engine [SPA] is employed to provide 
a MapReduce environment. For all graphs, our algorithm attains an approximation ratio never 
exceeding 1.4, which is far less than the theoretical upper bound and comparable with the one 
ensured by the SSSP-based approach. Compared to an implementation of A-stepping in Spark, 
our algorithm is up to two orders of magnitude faster. The better performance of our algorithm 
is also substantiated by more implementation-independent metrics such as the number of rounds 
and the aggregate work counted as number of node updates and messages generated. We per¬ 
formed scalability tests which, even in a somewhat small experimental testbed, demonstrate that 
our algorithmic strategy has the potential to exploit a much higher degree of parallelism, thus 
affording the analysis of much larger graphs. 

Paper organization. Section 2 introduces some basic terminology and defines our reference ma¬ 
chine model. Section 3 illustrates the graph decomposition at the core of the diameter approxi¬ 
mation algorithm, which is described in the subsequent Section 4. The experimental analysis is 
presented in Section 5. 


2 Preliminaries 

Let G = ( V,E,w ) be a connected undirected weighted graph with n nodes (set V ), m edges 
(set E), and a function w which assigns a positive integral weight w(e) to each edge e € E. 
We make the reasonable assumption that the edge weights are polynomial in n. (In fact, our 
results immediately extend to the case of positive real-valued weights as long as the ratio between 
maximum and minimum weight is polynomial in n.) The distance between two nodes u, v € V, 
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for short dist(w, v), is the weight of a minimum-weight path between u, v G V. The diameter 
of the graph, which we denote by <I>(G), is the maximum distance between any two nodes. We 
remark that while for convenience in the theoretical analysis we consider only connected graphs, 
our results extend straightforwardly to disconnected graphs, by defining the diameter as the 
largest distance between any two nodes in the same connected component. 

Definition 1. For any positive integer t < n, a r-clustering of G is a partition C = {Gi, G 2 ..., C T } 
ofV into t subsets called clusters. Each cluster Ci has a distinguished node Ui € Gj called cen¬ 
ter, and a radius r(Gj) = max„ 6 c' i {dist('U,:,w)}. The radius of a r-clustering G is r(G) = 
maxi<j< T {r(Gj)}. Finally, denote by Rg{t) the minimum among the radii of all r-clusterings 
of G. 

An important metric impacting the complexity of our algorithms is the number of edges 
required to connect nodes with minimum-weight paths. More formally, for a given distance 
parameter A, we define l a to be the minimum value such that for any two nodes u, v G V with 
dist(it, v) < A there is a minimum-weight path between u and v with at most t a edges. It is easy 
to see that £a is a nondecreasing function of A and that for any constant c > 0, £ c a = © (!a)- 

The performance of the algorithms presented in this paper will be analyzed considering their 
distributed implementation on the MR model of [PPR+12]. This model provides a rigorous 
computational framework based on the popular MapReduce paradigm [DG08], which is suitable 
for large-scale data processing on clusters of loosely-coupled commodity servers. Similar models 
have been recently proposed in [KSV10, GSZ11]. An MR algorithm executes as a sequence of 
rounds where, in a round, a multiset X of key-value pairs is transformed into a new multiset Y of 
pairs by applying a given reducer function (simply called reducer) independently to each subset 
of pairs of X having the same key. The model features two parameters Mt and Ml, where Mt 
is the total memory available to the computation, and Ml is the maximum amount of memory 
locally available to each reducer. We use MR(Mr, Ml) to denote a given instance of the model. 
The complexity of an MR (Mt,Ml) algorithm is defined as the number of rounds executed in 
the worst case, and it is expressed as a function of the input size and of Mt and Ml- In the 
big data realm, practical MR (Mt, Ml) algorithms should use total space linear in the input size 
and significantly sublinear local space, while minimizing the round complexity. 

The following fact is proven in [GSZ11, PPR+12]. 

Fact 1. The sorting and (segmented) prefix-sum primitives for inputs of size n can be performed 
in O (log Mi n) rounds in MR(Mt, Ml) with Mt = 0 (n). 


3 Parallel graph decomposition 

In this section we present a parallel algorithm to partition a connected undirected weighted 
graph G = (V, E, w) into clusters of small radius. The algorithm generalizes the one presented in 
[CPPU15] for the unweighted case. Specifically, we grow clusters in stages, where in each stage a 
new randomly selected batch of cluster centers is added to the current clustering and the number 
of uncovered nodes halves with respect to the preceding stage. The challenge in the weighted 
case is to perform cluster growth by exploiting parallelism while, at the same time, limiting the 
weight of the edges considered in each growing step: the goal is to avoid increasing excessively 
the weighted radius of the clusters, which influences the approximation quality. In other words, 
unlike the strategy in [CPPU15] for the unweighted case, we cannot afford to boldly grow a cluster 
by adding all nodes connected to its frontier since some of these additions may entail heavy edges. 
To tackle this challenge, we use ideas akin to those employed in the A-stepping parallel SSSP 
algorithm proposed in [MS03]. In particular, we limit a cluster’s growth by imposing a threshold 
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A on its radius. Unlike the A-stepping algorithm, however, this threshold is not fixed a priori 
but automatically tuned to a quasi-optimal value during the clustering process. 

Before presenting the algorithm, we introduce some technical details. Let A be an integral 
parameter which we use as a guess for the radius of the clustering. As in [MS03], we call an edge 
light if its weight is < A, and heavy otherwise. With each node u £ V the algorithm maintains 
a state consisting of two variables ( c u ,d u ): initially, c u is undefined and d u = oo. Whenever u 
is assigned to a cluster, then c u is set to the cluster center and d u is set to an upper bound to 
dist(c u , u). In particular, if u is chosen as a cluster center, then c u = u and d u = 0. The algorithm 
repeatedly applies, in parallel, edge relaxations of the kind used in the classical Bellman-Ford’s 
algorithm. More precisely, we define the following A -growing step: for each node u with d u < A 
and for each light edge (u,v), in parallel, if d u + w(u,v) < A and d v > d u + w(u,v) then the 
status of v is updated by setting d v = d u + w(u,v), and c v = c u . In case more than one node 
u can provide an update to the status of v, the update that yields the smallest value of d v and, 
secondarily, the one caused by the node u such that c u has smallest index, is performed. 

Suppose that a sequence of A-growing steps is performed starting from some set of centers 
X C V. After the execution of these steps, we can contract the graph as follows (Procedure 
Contract). For each center c € X, all nodes u € V with c u = c are removed, except c itself. 
For each edge (u,v): if both c u and c v are defined, the edge is removed; if both c u and c v are 
undefined, the edge is left unchanged; and if c u is defined and c v is undefined, the edge is replaced 
by a new edge (c u ,v) of weight w(u,v). 

Algorithm CLUSTER(G, r), whose pseudocode is given in Algorithm 1, grows clusters progres¬ 
sively in a number of stages, until all the nodes of the graph are covered. In each stage, which 
corresponds to an iteration of the outer while loop, a sequence of A-growing steps is executed, 
each with the goal of including into the current clusters at least half of the nodes that are still 
uncovered but are reachable from some cluster through a path of weight at most A. The set X of 
centers of the current clusters includes clusters partially grown in previous stages (if any), which 
are now contracted and represented only by their centers (set C(), and a set of O (r logn) centers 
randomly selected from the uncovered nodes. In each stage, geometrically increasing values of 
A, starting from a suitable initial value, are guessed until the coverage goal can be attained. 
(The sequences of A-growing steps for the various guesses of A are performed in the inner while 
loop through Procedure PartialGrowth.) When few nodes are left uncovered, these are added 
as singleton clusters and the algorithm terminates. Observe that at most O (log n) stages are 
executed and each contributes an additive factor A to the clustering radius. We will show below 
that the largest guess for A will be O ( Rg(t )), with high probability. 

Observe that when the algorithm terminates, each node u £ V is assigned to a cluster centered 
at some node c u . Let A enc j denote the value of A at the end of the execution of CLUSTER(G, r). 
In the following lemma, we show that with high probability A en d does not exceed Rg(t) by more 
than a constant factor. 

Lemma 1. A en d = O with high probability. 

Proof. Consider an arbitrary iteration of the outer while loop and let G,; = ( Vi,Ei ) be the 
(contracted) graph on which cluster growth is performed during the iteration. Let G,; C V) 
be the nodes of Gi representing clusters grown in previous iterations. Clearly, we have that 
\Vi — Ci \ > 8rlogn. Refer to the nodes of V) — G,; as uncovered nodes. We now show that, with 
probability at least 1 — 1/n, in G,; at least half of the uncovered nodes can be reached by the 
new centers selected in the iteration or by nodes of Ci with paths of weight at most 2Rq(t) 
traversing only uncovered nodes. 

Let C be a r-clustering of the whole graph with optimal radius r(G) = Rg{t). To avoid 
confusion, we refer to its clusters as C-clusters, while simply call clusters those grown by our 
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Algorithm 1: CLUSTER(G,r) 

A 4 — min{u>(u, v) : (u,v) £ E }; 7 t— 4In2 
Ci t— 0 /* current set of cluster centers */ 

Gi(Vi,Ei) <- G(V,E) 
i <— 1 

while \Vi — Cf\ > 8 rlogn do 

Select v £ Vi — Ci as a new center independently with probability 
X i — Ci U {newly selected centers} 

foreach u £ Vi do 

|_ if u £ X then ( c u ,d u ) <— (u, 0) else ( c u ,d u ) <— (nil, 00 ) 

V' £- 0 

while \V'\ < |Vi — Ci |/2 do 
PartialGrowth(Gi, A) 

V 7 {u £ V — Ci : d u < A} 

^ if \V'\ < | Vi - Ci|/2 then A -f- 2A 
Assign each 'u G V"' to the cluster centered at 
G i+ i(V j+ i,-Bi + i) e— Contract(G,;) 

C i+ i t— X 
i 4 — i + 1 

Assign each u £ V — Ci to a new singleton cluster centered at u 
/* Vi is the final set of cluster centers */ 

Procedure PartialGrowtlr(G, A) 
repeat 

perforin a A-growing step on G 
V' ■£- {u £ V(G) : d u < A} 
until (no state is updated) or (\V'\ > |P(G)|/2) 


algorithm. Consider the G-clusters that include some uncovered node. Among these G-clusters, 
those that contain less than |V) — Gi|/(2r) uncovered nodes account for a total of less than 
r\Vi — Cj|/2r = (V) — Gj|/2 such nodes. We call large the G-clusters that contain | Vi — G,;|/(2r) 
or more uncovered nodes. Therefore, large G-clusters account for more than — Cj|/2 uncovered 
nodes altogether. Let C( be any such large G-cluster. By the choice of 7 in the probability for 
center selection, we have that with probability > (1 — 1 /n 2 ) at least one uncovered node c! £ Ce 
is selected as a new center. Since, for every uncovered node v £ Ce, there is a path in G from d 
to v through cy of weight w < 2 Rq(t), there must be a path in Gi from from d to v of weight at 
most w in Gi. Note that the suffix of this path starting from the last cluster center has weight 
< w and traverses only unconvered nodes. The desired property follows by applying the union 
bound over all large G-clusters. 

Since in PartialGrowth clusters are grown from the newly selected centers as well as from the 
nodes of 6 }, any value A > 2 Rg(t) guarantees that half of the nodes in V, — Ci are covered by 
clusters. Consequently, A can never be doubled beyond ARq{t). The lemma follows by applying 
the union bound over all iterations. □ 

The following theorem states the main result of this section. 

Theorem 1. Let t be a positive integer. With high probability, CLUSTER(G, r) returns an 
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O (r log 2 n) -clustering of radius O ( Rg{t ) log n) by performing O (f.R G ( T ) log n) A-growing steps, 
with A = O ( Rg(t ))• 

Proof. By Chernoff’s bound each iteration of the outer while loop selects O (rlogn) new cluster 
centers with high probability. Hence, the bound on the number of clusters follows applying 
the union bound over the O(logn) iterations of this loop. As for the bound on the clustering 
radius, we observe that the aggregate number of iterations of the inner while loop is at most 
log n + log A end , and this number is O (log n) with high probability by virtue of Lemma 1 and the 
assumption on the polynomiality of the edge weights. The bound follows by observing that each 
such iteration increases the radius of cluster by an additive term at most A end = O (Rg(t)). 

Finally, observe that in every iteration of the inner while loop the number of A-growing 
steps executed by procedure PartialGrowth is at most < 1-A tld since, by the properties of 
edge relaxations, after those many growing steps all nodes at distance less than A from some 
center of X have been reached by the closest center in X with a minimum-weight path, hence 
their state cannot be further updated. Therefore, the final number of A-growing steps will be 
O (f A end log n) = 0 (Zr g ( t ) log n ), with high probability. □ 

4 Diameter Approximation 

In this section, we present an algorithm to estimate the diameter of a weighted graph as a 
function of the diameter of a suitable (much smaller) quotient graph of a clustering obtained 
through a refined version of the strategy devised in the previous section. As it will be clarified 
by the analysis, the refinement is introduced to achieve a provable bound on the approximation 
guarantee, by ensuring that not too many clusters have the potential to reach small neighbor¬ 
hoods of the graph. Algorithm CLUSTER2(G, r), whose pseudocode is given in Algorithm 2, 
builds the required clustering by first computing the radius 7 ? C l( t ) of the clustering returned by 
CLUSTER(G, t) and then executing log n iterations where uncovered nodes are selected as new 
cluster centers with probability doubling at each iteration. In the i -th iteration, both previous 
and new clusters are grown using 2R cl (t) -growing steps until all uncovered nodes at distance at 
most 2R cl (t) from them are reached (Procedure PartialGrowth2). At the end of the iteration, 
the graph is contracted using Procedure Contract2, which is similar to Procedure Contract used 
in CLUSTER with the only difference that each original edge (u,v) of weight w(u,v) < 2R C l(t) 
and such that c u is defined and c v is undefined, is replaced by a new edge (c u ,v) with rescaled 
weight d u + w(u,v ) — 2R cl (t). (In fact, original edges of weight greater than 2R cl (t) are never 
used by CLUSTER2.) 

The following lemma analyzes the quality of the clustering returned by CLUSTER2(G, r) and 
upper bounds the number of growing steps performed. 

Lemma 2. Let r be a positive integer. With high probability, CLUSTER2(G, r) computes an 
O (t log 4 n) -clustering of radius i?cL 2 ( r ) = O (Rg(t) log 2 n.) by performing O {(-R G ( T )\og,n logn) 
A-growing steps with A = O ( Rg(t ) logn). 

Proof. By Theorem 1 we know that with high probability the invocation of CLUSTER(G, r) at the 
beginning of the algorithm computes a K -clustering, with I\ = O (r log 2 n), of radius Rcl(t) = 
O (Rg(t) log?r). In what follows, we condition on this event. Then, the bounds on the number 
of growing steps and on Rcl 2 (t) are straightforward. For 7 = 4/ log 2 e, define H as the smallest 
integer such that 2 H /n > ( 7 A'logn)/n, and let t = logn — H. We now show that the number 
of original nodes of G not yet reached by any cluster decreases at least geometrically at each 
iteration of the for loop after the H -th one. Recalling that Vjj+i — Cn+i is the set of original 
nodes of G that at the beginning of Iteration H + i have not been reached by any cluster, for 
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Algorithm 2: CLUSTER2(G, r) 

Let .Rcl(t) be the radius of the clustering returned by CLUSTER(G, t) 

Gi 4— 0 /* (current set of cluster centers) */ 

G^VuEJ <- G(V,E) 

for i 4 — 1 to logn do 

Select v £ Vi — Gj as a new center independently with probability 2 1 /n 
X i — Gj U {newly selected centers} 

foreach u £ V, do 

|_ if u £ X then ( c u ,d u ) (u, 0) else ( c u ,d u ) <— (nil, oo) 

PartialGrowth2(G,;, 2_R C l(t)) 

Gi+i(Vj+i, e— Contract2(G,) 

Gj+i -s— X 

Procedure PartialGrowth2(G, A) 

repeat perforin a A-growing step on G until no state is updated 


1 < i < t, define the event E t =“at the beginning of Iteration H + i, Vh+% — Cn+i contains at 
most n/2 * -1 nodes”. We now prove that the event D l-pEp occurs with high probability. Observe 
that: 

t-1 

P r(n? =1 ^<) = Pr(^i)n p r(^ + il^n-n^) 

i =1 

t -1 

= ]jpr(£; i+ 1 |£; 1 n---n£ , i ), 

i—l 

since E\ clearly holds with probability one. Consider an arbitrary i, with 1 < i < t, and assume 
that Ei (~l • • • 0 Ei holds. We prove that Ei + 1 holds with high probability. Since Ei holds, we 
have that at the beginning of Iteration H + i, the number of nodes in Vn+i — Cn+i is at most 
n/2 1 ~ 1 . Clearly, if | Vn+i — Cr+%\ < n/2 l then E i+ \ trivially holds with probability one. Thus, 
we consider only the case 

l<\v H+i -c H+i \<^. 

In order to show that E i+ \ holds also in this case, we resort to the same argument used in the 
proof of Lemma 1. Let C be a A'-clustering of the whole graph with optimal radius Rg{K) and 
observe that Rg{K) < R c L (r). To avoid confusion, we refer to its clusters as C-clusters, while 
simply call clusters those grown by CLUSTER2. Consider the G-clusters that include some nodes 
of Vn+i — Cn+i, and call one such cluster large if it contains at least 

\V H +i ~ C H +i\ ^ n^logn 
2 K > 2 H + i+1 

nodes of Vn+i — Cn+i- This implies that the large G-clusters contain, altogether, at least half of 
the nodes of Vn+i — Cn+i- Moreover, by the choice of 7 , it is easy to argue that with probability 
> (1 — 1/n) at least one new center is selected from each large G-cluster in Iteration H + i. 
Consider now an arbitrary large G-cluster centered at cp: and let c' £ Cp be a new center selected 
from this cluster in the iteration. For every v € Cp n (Vn+i — Cn+i) there is a path in G from d 
to v (through cp) of weight w < Rcl{t), hence there must be a path in G,; from from c! to v of 
weight at most w. It then follows that node v will be covered by some cluster in Iteration H + i. 





Consequently, in the iteration at least half of the nodes of Vh+% — Ch+% will be covered by 
clusters, with probability at least 1 — 1/n. 

By multiplying the probabilities of the O (log n) conditioned events, we conclude that event 
nf =1 Ej occurs with high probability. Note that in the last iteration (Iteration H+t ) all uncovered 
nodes are selected as centers with probability 1, and, if n* =1 .E) occurs, these are O (Klogn). Now, 
one can easily show that, with high probability, in the first H iterations, O (K log 2 n) clusters 
are added and, by conditioning on fl-jat the beginning of each Iteration H + i, 1 < i < t, 
O ( K log n) new clusters are created, for a total of O (K log 2 n) = O (r log 4 n) clusters. □ 

Observe that for fixed r, the clustering returned by CLUSTER2 has a larger number of clus¬ 
ters and a weaker guarantee on its radius than the the clustering returned by CLUSTER. As such, 
CLUSTER2 does not appear to be a very desirable clustering strategy in itself. However, CLUSTER2 
enforces the following important property which will be needed for proving the diameter approx¬ 
imation. With reference to a specific execution of CLUSTER2, define the light distance between 
two nodes u and v as the weight of the minimum-weight path from u and v consisting only of 
edges of weight at most 2R cl (t). (Note that the light distance is not necessarily defined for every 
pair of nodes.) 

Due to the weight rescaling performed by Contract2 at the end of each iteration, given a 
center c selected at a certain Iteration i of the for loop, and a node v at light distance d from 
c, the cluster centered at c cannot grow to reach v in less than \d/2R CL (r)] iterations and that 
in those many iterations v will be reached by some cluster (possibly the one centered at c). 
Consequently, no center selected at a later iteration at that same distance from v as c would be 
able to reach v. 

We are now ready to present the main result of this section, which shows how CLUSTER2 
can be employed to determine a good approximation to the graph diameter. Suppose we run 
CLUSTER2 on a graph G = ( V,E,w ) to obtain a clustering C of radius -Rcl2(t). For each u € V , 
let c u be the center of the cluster assigned to u, and let d u be distance between u and c u returned 
by CLUSTER2. As in [Mey08], we define the weighted quotient graph associated to C as the graph 
Gc where nodes correspond to clusters and, for each edge (u,v) of G with c. u ^ c v , there is an 
edge in Gc between the clusters of u and v with weight w(u, v) + d u + d v . (In case of multiple 
edges between two clusters, it is sufficient to retain only the one yielding minimum weight.) Let 
<I>(G) (resp., <b(Gc)) be the weighted diameter of G (resp., Gc)- We approximate $(G) through 
the value $ apP rox(G) = $(Gc) + 2R c - L2 (t). It is easy to see that our estimate is conservative, 
that is, $a PP rox(G) > d>(G). We have: 

Theorem 2. With high probability, 

< I > a PP rox(G) = O ($(G) log 3 n) . 

Proof. Since Rql 2 { t ) = O (Rq(t) log 2 n) (by Lemma 2), and Rg(t) = 0($(G)), we have that 
I?cl 2 (t) = O (<f>(G) log 2 n). In order to show that <b(Gc) = O ($(G) log 3 n) , let us hx an 
arbitrary pair of cluster centers and an arbitrary minimum-weight path tt between them in G, 
and let w w be the weight of tt. Let ttc be the path of clusters in Gc traversed by tt. We now 
show that with high probability the weight of ttc in Gc is O ($(G) log 3 n). The bound on the 
approximation will then follow by applying the union bound over all pairs of cluster centers. 
Consider first the case 2R cl (t) > &{G) (note that this can happen since the clustering yielding 
the radius Rcl(t) determined at the beginning of CLUSTER2 is built out of paths using only light 
edges of weight O (Rg(t))). In this case, it is easy to see that the first batch of centers ever 
selected in an iteration of the for loop of CLUSTER2 will cover the entire graph, and these centers 
are O(logn) with high probability. Therefore, ttc contains O(logn) clusters and its weight is 


9 



O (w„ + Rcl 2 (t) logn) = O (<I>(G) log 3 n). Suppose now that 2 /? cl (t) < $(G). We show that at 
most O ({w v /Rcl(t)~\ log 2 n) clusters intersect tt (i.e., contain nodes of 7 r), with high probability. 
It can be seen that 7 r can be divided into O (|"w 7 r /i?cL('r)l) subpaths, where each subpath is 
either an edge of weight > Rcl(t) or a segment of weight < R C l(t). It is then sufficient to show 
that the nodes of each of the latter segments belong to O (log 2 n) clusters. Consider one such 
segment S. Clearly, all clusters containing nodes of S must have their centers at light distance 
at most Rcl 2 (t ) from S (i.e., light distance at most Rcli{t) from the closest node of S). Recall 
that Rcl 2 (r) < 2Rcl(t) logn. For 1 < j < 21og?z + 1, let C(S,j) be the set of nodes whose 
light distance from S is between ( j — 1 )Rcl(t) and jI?cl(t) — 1, and observe that any cluster 
intersecting S must be centered at a node belonging to one of the C(S,j)’ s. We claim that, 
with high probability, for any j, there are O (logn) clusters centered at nodes of C(S,j) which 
may intersect S. Fix an index j, with 1 < j < 2logn + 1, and let ij be the first iteration 
of the for loop of CLUSTER2 in which some center is selected from C(S,j). By the property of 
CLUSTER2 discussed after Lemma 2, [((j + 1)Rcl(t) — 1)/(2I?cl(t))] iterations are sufficient for 
any of these centers to cover the entire segment. On the other hand, any center from C(S,j) 
needs at least [ (j — 1 )Rcl ( t )/ (2-R C l( t ))] iterations to touch the segment. Hence, we have that no 
center selected from C(S,j) at Iteration ij + 2 or higher is able to reach S. It is easy to see that, 
due to the smooth growth of the center selection probabilities, the number of centers selected 
from C(S,j) in Iterations ij and ij + 1 is O(logn), with high probability. This implies that 
the nodes of segment S will belong to O (log 2 n) clusters, with high probability. By applying 
the union bound over all segments of 7 r, we have that O (\w^/R C l(t)] log 2 n) clusters intersect 
7 r, with high probability. Therefore, the weight of nc is O (w^ + RcL 2 (T)\iu n /R cl(t)~\ log 2 n) = 
0(<h(G) + RcL 2 (r)($(G)/I?cL(T))log 2 n) = O ($(G) log 3 n). □ 

4.1 Implementation in the MR (M T ,M L ) model 

We now discuss the implementation of the above diameter approximation algorithm in the MR 
model using overall linear space and show that, for a relevant class of graphs, its round complexity 
can be made asymptotically smaller than the one required to obtain a 2-approximation through 
the state-of-the-art SSSP algorithm by [MS03]. For a given connected weighted graph G = 
(V. E, w), consider the MR (Mr, Mr) model with total memory Mr linear in the graph size, and 
local memory Ml £ O (n £ ), for some constant e £ (0,1). We begin by observing that, regardless 
of the number of active clusters, a A-growing step, for any A, can be implemented through a 
constant number of simple prefix and sorting operations which, by Fact 1, require O (1) rounds 
on MR(Mr, Ml). By combining this observation with the results of Theorem 1 and Lemma 2, we 
obtain that CLUSTER2(G, r) can be implemented in O (^l? g ( t ) logn logn) rounds in MR (Mr, Mr). 

For an arbitrary positive constant e' < e, let r = \n € ] and let Gc = (Vc, Ec,wc) be 
the weighted quotient graph associated with the clustering returned by CLUSTER2(G, r). The 
diameter 4 , (Gc) (in fact, a constant approximation to this quantity), and, consequently, the 
value approx(G) , can then be computed in O (1) rounds in MR(Mr, Mr) by adopting the same 
techniques described in [CPPU15]. 

The following theorem summarizes the above discussion. 

Theorem 3. Let G be a connected weighted graph with n nodes, m edges and weighted diameter 
<1 ) (G). Also, let e 1 < e £ (0,1) be two arbitrary constants, and let r = [ n e ] . On the MR(Mt, Ml) 
model, with Mq = 0 (m) and Ml = 0 (n e ), an upper bound 4 > approx(G) = O (4>(G) log 3 n) to 
the diameter of G can be computed in O (f , _R G ( T )iognl°g n ) founds, with high probability. 

Note that the round complexity depends on the characteristics of the graph and is nonincreas¬ 
ing in the number of clusters, which are in turn controlled by parameter r. For general graphs, the 
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value £r g (t) logn logn is O (£&(G) logn) while the analysis in [MS03] implies that under the linear- 
space constraint a natural MR-implementation of A-stepping requires ft (£§(g) logn) rounds. 
Hence, our algorithm cannot be asymptotically slower than A-stepping. However, we show 
below that our algorithm becomes considerably faster for an important class of graphs. 

The following definition introduces a concept that a number of recent works have shown to 
be useful in relating algorithms’ performance to graph properties [AGGM06]. 

Definition 2. Consider an undirected graph G = (V,E). The ball of radius R centered at node 
v is the set of nodes reachable through paths of at most R edges from v. Also, the doubling 
dimension of G is the smallest integer b > 0 such that for any R > 0, any ball of radius 2 R can 
be covered by at most 2 b balls of radius R. 

We can specialize the result of Theorem 3 as follows. 

Corollary 1. Let G be a connected graph with n nodes, in edges, maximum degree d £ 0(1), 
doubling dimension b £ 0(1), and positive integral edge weights chosen uniformly at random 
from a polynomial range. Denote by <I>(G) and *1 '(G), respectively, the weighted and unweighted 
diameter of G. Also, let e’ < e £ (0,1) be two arbitrary constants. On the MR(Mt, Mr) model, 
with Mg = 0 (m) and Ml = 0(n e ), an upper bound < t> approx (G) = O ( C E > (G) log 3 n) to the 
diameter of G can be computed in 


O 


-*(G)- 

n e '/ b 



rounds, with high probability. 

Proof. By iterating the definition of doubling dimension starting from a single ball of unweighted 
radius tf'(G) containing the whole graph, we can decompose the graph into r disjoint clusters 
of unweighted radius if = O (|"\E , (G)/r 1 / b ]). Letting W be the maximum edge weight, we 
have that if ■ W upper bounds Rg{t). We know that our algorithm computes the diameter 
approximation in O {1 Rg i t ) i og n logn) rounds. We will now give an upper bound on (.r g ( t ) i og n- 
By using results from the theory of branching processes [Dwa69] we can prove that by removing 
all edges of weight > with high probability the graph becomes disconnected and each 

connected component has O (log n ) nodes. (More details will be provided in the full version of 
the paper.) As a consequence, with high probability any simple path in G will traverse an edge 
of weight > [bP/dJ = fi (IF) every O (logn) nodes. This implies that a path of weight at most 
^G(r)logn = O ([yrTt] W\ogn) has £r g ( t ) logn = O (log 2 n) edges. The theorem follows 
by setting r = \n € ]. □ 

For what concerns the comparison with A-stepping, the analysis in [MS03] implies that 
under the linear-space constraint, for a graph G with random uniform weights a natural MR- 
implementation of A-stepping requires fi(\H(G)) rounds. Thus, by the above corollary, if G has 
bounded doubling dimension the round complexity of our algorithm can be made smaller by a 
sublinear yet polynomial factor which is a function of the available local space Mr. 

We conclude this section by observing that in the case of very skewed graph topologies and/or 
weight distributions under which the hypotheses of Corollary 1 do not hold, the factor (r g ( t ) i ogn 
in the round complexity of our algorithm can be large, thus reducing the competitive advantage 
with respect to A-stepping. We can somewhat overcome this limitation by imposing an upper 
limit O ( n/r) (resp., O ((n/r) logn)) to the number of growing steps performed in each execution 
of PartialGrowth within CLUSTER(r) (resp., PartialGrowth2 within CLUSTER2(r)). It can be 
shown that, in this case, the round complexity of our algorithm, for general graphs, becomes 
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Graph 

n 

m 

$(G) 

roads-USA [DIM] 

23,947,347 

29,166,673 

55,859,820 

roads-CAL [DIM] 

1,890,815 

2,328,872 

16,485,258 

livejournal* [SNA] 

3,997,962 

32, 681, 189 

9.41 

twitter* [LAW] 

41,652,230 

1,468,365,182 

9.07 

mesh(S)* 

S 2 

2S(S — 1) 

t 

R-MAT(S)* [CZF04] 

2 s 

16-2 s 

t 

roads(S) R 

iS- 2.3 • 10 7 

« 5-5.3- 10 7 

t 


f the diameter depends on the size of the graph, controlled by S > 1. 


Table 1: Benchmark graphs: n is the number of nodes, m the number of edges and <1>(G) is 
the weighted diameter. Graphs marked with * have edge weights following a random uniform 
distribution € (0,1]. The twitter graph, originally directed, has been symmetrized. 


O (min{(?i/r) logn, (. _r g ( t ) logn} log n) at the expenses of an extra O ( \(- r g ( t ) logn/ ((ti/r) log n )]) 
factor in the approximation ratio. The argument revolves around the existence of quasi-optimal 
clusterings of bounded unweighted depth. (More details will be provided in the full version of 
this paper.) 


5 Experimental Analysis 

Our experimental platform is a cluster of 16 nodes, each equipped with a 4-core 17 processor 
and 18GB RAM, connected by a 10Gbit Ethernet network. Our algorithms are implemented 
using Apache Spark [SPA], a popular framework for big data computations that supports the 
MapReduce abstraction adopted by our algorithms. The experiments have been run on several 
graphs whose properties are summarized in Table 1 and can be classified as follows: a) road 
networks (roads-USA and roads-CAL), b) social networks (livejournal and twitter), c) syn¬ 
thetic graphs (mesh(S), R-MAT(S), and roads (S), where S is a parameter controlling the size 
of the graph). The latter class contains artificially generated graphs whose size can be made 
arbitrarily large and whose topological properties reflect those of the real networks in the first 
two classes. In particular, R-MAT(S) are graphs with a power-law degree distribution and small 
diameter [CZF04], and roads (S) are graphs obtained as the cartesian product of a linear array 
of S nodes S > 1 and unit edge weights with roads-USA. Finally, mesh(S) is an S x S square 
mesh included since it is a graph of known doubling dimension b = 2 for which the results of 
Corollary 1 hold. All road networks come with original integer weights while for the other graphs, 
which are born unweighted, we assigned uniform random edge weights in (0,1] according to the 
approach commonly adopted in the literature. 

We implemented a simplified version of our diameter approximation algorithm, dubbed 
CL-DIAM, where, for efficiency, we used CLUSTER, rather than CLUSTER2, for computing the graph 
decomposition. In fact, CLUSTER2 first runs CLUSTER to obtain an estimate of the radius, and 
then computes a second decomposition which is instrumental to provide a theoretical bound to 
the approximation factor, but which does not seem to provide a significant improvement to the 
quality of the approximation in practice. 

As a second optimization, we ran CLUSTER using an initial value of A larger than the minimum 
edge weight, as was specified in the pseudocode. We observe that by increasing the initial value 
of A, the round complexity improves since less doublings are required before hitting the final 
value. On the other hand, setting the initial value of A too large may yield a larger cluster 
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radius, possibly incurring a worse diameter approximation. To explore this phenomenon, we 
experimented on mesh(S') with S = 2048 and random edge weights, such that an edge has weight 
1 with probability 0.1 and 10 -6 otherwise. With high probability, such a graph can be completely 
covered using clusters that do not contain edges with weight 1: including one of those edges in 
a cluster would make its radius far bigger than it needs to be. We ran our algorithm with two 
configurations. The first configuration started with A = 10 -6 (i.e., the minimum edge weight) so 
to let the algorithm tune itself to the final value A (= 6.4-10 -5 ); the second configuration started 
with an inital A equal to the graph diameter (ss 2.004) so that no doubling of A was needed. The 
diameter approximation obtained by the second configuration was about 2.5 times larger than 
the actual diameter, whereas the first configuration obtained an approximation ratio of 1.0001. 
A set of experiments (omitted here for brevity) showed that a good initial guess for A is the 
average edge weight, which reduces the round complexity without affecting the approximation 
quality significantly. Therefore, all our experiments have been run with this initial guess of A. 

Finally, in all of our experiments the parameter r was set to yield a number of nodes in the 
quotient graph < 100, 000. This choice of r was made to ensure that, for all instances, the final 
diameter computation in the quotient graph would not dominate the running time. 

The following paragraphs describe in detail the different sets of experiments that we per¬ 
formed. 

Comparison with the SSSP-based approximation. Recall that an SSSP algorithm can be used to 
yield a 2-approximation to the diameter by returning twice the weight of the heaviest shortest 
path. Thus, we compared our algorithm CL-DIAM with a Spark implementation of the A-stepping 
SSSP algorithm (starting from a random node), which is the state of the art for parallel SSSP 
and is in fact our only practical competitor on weighted graphs. In A-stepping, parameter A can 
be set to control the tradeoff between parallel time (i.e., rounds in the MapReduce context) and 
total work. For each graph, we tested A-stepping with several values of A, selecting the value 
yielding the best running time. Since in MapReduce-like environments the number of rounds 
has a significant impact on the running time, not surprisingly, for all graphs the best value of A 
was always the one minimizing the number of rounds. 

The results of the comparison are summarized in Table 2 and graphically represented in 
Figures 1, 2, and 3. In the table we report, for each graph, the diameter approximation factor 
and the running time. Along with these, we also report two additional measures, namely, the 
number of rounds and the work (defined as the sum of node updates and messages generated), 
that allow to compare the two algorithms in a more platform-independent way. It has to be 
remarked that the approximation quality returned by CL-DIAM on all benchmark graphs, a value 
always less that 1.4, is much better that the theoretical O (log 3 n) bound. Also, our algorithm is 
from about one to two orders of magnitude faster than A-stepping, while featuring comparable 
approximation ratios (Figure 1). As expected, the higher performance of CL-DIAM is consistent 
with the fact that it requires far less rounds than A-stepping, as shown in Figure 2. 

For what concerns the work, the better performance of CL-DIAM, as shown in Figure 3, is 
mainly due to the smaller number relaxations with respect to A-stepping. Indeed, CL-DIAM 
explores paths only up to a limited depth, whereas A-stepping needs to run until all the nodes 
are labeled with the optimal distance from the source. In fact, A-stepping could limit the 
amount of relaxations by using a smaller A, but in doing so it would incur an increase of 
the number of rounds, hence exhibiting worse performance. The gap between A-stepping and 
CL-DIAM in terms of both number of rounds and work suggests that our algorithm is likely to 
remain competitive on other distributed-memory platforms employing programming frameworks 
alternative to MapReduce. 

We remark that the experiments reported in Table 2 involve graphs of moderate size for which, 
not surprisingly, much better running times can be obtained on a single machine equipped with 
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graph CL- 

approximation 
-DIAM A-stepping 

CL-DIAM 

time 

A-stepping 

CL-DIAM 

rounds 

A-stepping 

CL-DIAM 

work 

A-stepping 

roads-USA 

1.26 

1.09 

158 

14,982 

74 

11,268 

4.22-10 8 

1.35-10 11 

roads-CAL 

1.25 

1.22 

13 

917 

16 

2,639 

1.70-10 7 

2.27-10 9 

mesh 

1.23 

1.30 

46 

1,239 

70 

2,997 

1.13-10 8 

1.58-10 8 

livejournal 

1.22 

1.10 

19 

74 

7 

42 

1.97-10 8 

4.81-10 8 

twitter 

1.19 

1.32 

236 

601 

5 

35 

4.71-10 9 

1.20-10 10 

R-MAT(24) 

1.33 

1.02 

144 

1,493 

4 

41 

7.45-10 8 

4.2010 9 


Table 2: CL-DIAM vs A-stepping. For each benchmark, the table shows the running time (in 
seconds), the approximation ratio, the number of rounds, and the work of the two algorithms. 
The approximation ratio is expressed in terms of a lower bound to the true diameter computed 
by running the sequential SSSP algorithm multiple times, each time starting from the farthest 
node reached by the previous run. 


Graph 

time (seconds) 

R-MAT(29) 

6218 

roads(32) 

14054 


Table 3: Experiments on big graphs 


sufficient main memory. In fact, the purpose of those experiments was not to attain best absolute 
performance on the individual graphs but, rather, to compare the relative performance of CL-DIAM 
with the one of A-stepping. Since it is conceivable to expect that the relative performance of 
two algorithms does not change as the graphs size grows, performing this comparison on much 
larger graphs would have only encumbered the experimental work without changing the overall 
outcome. Nevertheless, we performed further experiments, reported in the next subsection, to 
provide evidence that our algorithm scales well with respect to machine and input size. 

Scalability. To check the scalability of CL-DIAM with respect to the number of machines, we ran 
it using 2, 4, 8, and 16 machines on R-MAT(26) and roads(3) which have approximately the 
same number of nodes but have topologies of different nature. The results are reported in Figure 
4 which shows that, for both graphs, the algorithm exhibits excellent scalability. 

Finally, we ran CL-DIAM on R-MAT(29) and roads (32), which are much larger graphs than 
those employed in the experiments reported in Table 2, and for which the running time of 
A-stepping would be unpractically high on our platform. The running times on 16 machines 
are shown in Table 3. Considering that the size of R-MAT(29) (resp., roads (32)) is 32 (resp., 
about 57) times larger than the size of R-MAT(24) (resp., roads-USA) the experiment shows that 
CL-DIAM’s performance scales well with the graph size on the same machine configuration. A 
slight penalty in the running time on large graphs is to be expected due to the increased number 
of interactions with the disks occurring in each machine because of the large graph size. 

Altogether, the experiments suggest that our algorithm can be effectively employed to provide 
a good estimate of the diameter of huge graphs on sufficiently large clusters of commodity 
processors. 
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Figure 1: Approximation ratio of CL-DIAM and A- 
stepping. 


I-1 CLUSTER 

A-stepping 



Figure 3: Work performed by CL-DIAM and A-stepping. 
The scale is logarithmic. 



Figure 2: Number of rounds required by CL-DIAM and 
A-stepping. The scale is logarithmic. 


4500 


- R-MAT(26) 



Figure 4: Scalability of CL-DIAM wrt the number of ma¬ 
chines. 
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